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ABSTRACT 

Positional MEDLINE (PosMed; http://biolod.org/ 
PosMed) is a powerful Semantic Web Association 
Study engine that ranks biomedical resources 
such as genes, metabolites, diseases and drugs, 
based on the statistical significance of associations 
between user-specified phenotypic keywords and 
resources connected directly or inferentially 
through a Semantic Web of biological databases 
such as MEDLINE, OMIM, pathways, co-expres- 
sions, molecular interactions and ontology terms. 
Since 2005, PosMed has long been used for in 
silico positional cloning studies to infer candidate 
disease-responsible genes existing within chro- 
mosomal intervals. PosMed is redesigned as a 
workbench to discover possible functional inter- 
pretations for numerous genetic variants found 
from exome sequencing of human disease 
samples. We also show that the association 
search engine enhances the value of mouse 
bioresources because most knockout mouse re- 
sources have no phenotypic annotation, but can 
be associated inferentially to phenotypes via genes 
and biomedical documents. For this purpose, we es- 
tablished text-mining rules to the biomedical docu- 
ments by careful human curation work, and created 
a huge amount of correct linking between genes and 
documents. PosMed associates any phenotypic 
keyword to mouse resources with 20 public data- 
bases and four original data sets as of May 2013. 



INTRODUCTION 

Mouse bioresources contribute to the study of human 
genes and diseases (1,2). To elucidate the function of all 
mouse genes, the International Knockout Mouse 
Consortium systematically generates mutant embryonic 
stem cells for every protein-coding gene (3), and the 
International Mouse Phenotype Consortium produces 
knockout mice and carries out high-throughput 
phenotyping of each line (4). Including other mouse re- 
sources, >24000 mouse strains are registered in the 
International Mouse Strain Resource (IMSR) (5). To 
enhance the value of bioresources, we applied our 
original statistical search engine called the General and 
Rapid Association Study Engine (GRASE) and provided 
this as a web-oriented service called Positional MEDLINE 
(PosMed) (6-9). PosMed not only allows users to retrieve 
mouse bioresources directly with phenotypic keywords 
described in bioresource annotations, but also inferentially 
through corresponding documents for genes, diseases, 
drugs, ontologies, pathways, metabolites, molecular inter- 
actions and MEDLINE abstracts. With this inferential 
association search function, PosMed discovers wider re- 
sources than simple keyword search and accelerates the 
utilization of bioresources, especially those having fewer 
phenotypic annotations. In particular, knockout strains 
are not fully used when the targeted gene has an 
unknown function and no observed phenotype. PosMed 
connects these functionally unknown genes to known 
genes using molecular interactions, pathway information 
and/or co-citations and enables the suggestion of unob- 
served phenotypic bioresources as a search result. 

PosMed is also applicable to the functional interpret- 
ation of genetic variants detected by exome sequencing 
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studies. When users submit a list of genes and a pheno- 
typic keyword, PosMed ranks the genes by statistical rele- 
vance between the keyword and each gene (6-9). These 
search functions are implemented using Semantic Web 
Association Study (SWAS) technology. The Semantic 
Web and linked data originally aim to provide a 
common framework that allows data to be shared and 
reused across application, enterprise and community 
boundaries (10,11). For most biological research 
purposes, however, association studies of linked data 
provide more analytical insights into biological systems 
than simple pattern-matching queries of the data (6). To 
take advantage of the Semantic Web of biological linked 
data, we propose to extend the methodology of associ- 
ation studies to the methodology called a 'Semantic Web 
Association Study (SWAS)' (Figure 1). A typical example 
of SWAS is the 'Genome-wide Association Study 
(GWAS)', which focuses only on the association 
between allelic variants and phenotypes in different indi- 
viduals. In expanding the methodology of GWAS, SWAS 
explores more distant correlations among genes, func- 
tions, publications, alleles, lines, phenotypes and any 
subset specified by a user's keywords. Because the conven- 
tional Semantic Web Resource Description Framework 
(RDF) (http://www.w3.org/RDF/) and query language 
SPARQL (http://www.w3.org/TR/rdf-sparql-query/) do 
not adequately support statistical evaluation of semantic 
links, we developed the GRASE for the implementation of 
PosMed (6,8). 

General usage of PosMed for bioresources 

PosMed prioritizes genes, bioresources, diseases, metabol- 
ites or drugs depending on the statistical relevance 
between a user's keyword and biological documents. The 
algorithm computing P-value is described in our previous 
publications (6,7). PosMed provides paths connecting the 
user's keyword to the targeted resources. Figure 2 shows 
an example of two-step inferential or indirect search result 
associating a mouse resource with the keyword 'diabetes'. 
Although the mouse strain 'B6.129S6-Gcg<tmlYhys>' 
was not directly annotated with 'diabetes', PosMed sug- 
gested it via mouse gene 'gcg, glucagon', which has thou- 
sands of documents annotated with 'diabetes'. PosMed 
provides up to three steps of inferential search function. 
For more examples such as specified genomic interval 
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Figure 1. Concept of SWAS, which calculates the statistical signifi- 
cance of associations between any sets of resources connected 
through a web of semantic links (solid arrows), while GWAS associates 
only between alleles and phenotypes (dashed arrows). 



queries, please see the tutorial provided on our PosMed 
Web site. 

Advanced search with selection of biological documents 
and search paths 

PosMed provides several options for users to select the 
search paths, the documents used for the search, and the 
search scoring method from 'expert mode' of the advanced 
search setting page (Figure 3). With the expert mode 
menu, users can also select whether or not to use statistical 
significance association or Boolean association methods to 
associate biological items such as genes, chemicals and 
bioresources to a user's query directly or indirectly 
through user-selected search paths. The statistical signifi- 
cance associates biological items based on the ^-values 
using Fisher's exact test of co-occurrence of the linked 
items in documents of OMIM, pathways, protein- 
protein interaction, gene ontology, phenotype ontology 
and other annotations, while the Boolean method associ- 
ates the linked items co-occurring in the documents 
equally by ignoring the degree of significance (6). 

PosMed assists functional interpretation after 
exome sequencing 

Exome sequencing studies usually find several hundred to 
several thousand genetic variants by comparing samples 
and controls. To help prioritize the thousands of candi- 
date genes for which PosMed calculates the ranking, 
PosMed accepts a list of gene IDs with the user's descrip- 
tions of the gene variants. The descriptions in the 
uploaded file are displayed together with the ranked 
gene, allowing users to interpret the functionality of the 
gene variations (Figure 4). Detailed pages for each gene 
assist functional interpretation by showing biological 
documents such as MEDLINE, gene annotations, 
OMIM, bioresources, pathway information, molecular 
interactions, ontologies and links to related databases 
(Table 1). 

Extension of data coverage 

Since previous publication, we updated 17 databases to 
include ~10 million biological documents (Table 1; 7,9). 
To enhance the inferential search function for biore- 
sources retrieval, we newly installed the following three 
biological documents: mammalian phenotype ontology 
(MP), human disease ontology (DO) and the 
International Classification of Diseases (ICD-10) and 
updated semantic links between each biomedical 
document. For example, we re-annotated mouse gene to 
MEDLINE by defining named entity recognition (NER) 
rules to retrieve correct publications (6-9). For human 
genes, we connected to publications via mouse homologs 
and newly defined NER rules for 2249 human non- 
homolog genes against mouse. Users can download 
these NER rules from our Web site. 

In our previous publications, we used molecular inter- 
actions and co-expression to make links from fewer 
annotated biological resources to well annotated re- 
sources. These relationships are important to show more 
candidates. On the other hand, PosMed accuracy is 
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(C) list of candidates (B) Search result descriptions 

Figure 2. Example inferential search result followed by direct search results for retrieving a mouse bioresource associated with the keyword 
'diabetes'. PosMed shows the path connecting from a user's keyword to the resource, a resource description and linked biological documents (B). 
To download all candidate mouse strains, click 'check all' at the top of 'Hit resources' and download them as a text file (C). 



strongly affected by low-quality data. Because Omics data 
are accumulated with various experimental methods, we 
selected high-quality data and removed low-trust data 
such as the classical yeast two-hybrid of protein-protein 
interaction (28). 



For Semantic Web compliant data preparation, we used 
RIKENBASE or the RIKEN Scientists' Networking 
System (SciNetS) (29), and public data are downloadable 
though Biophenome Linked Open Databases (BioLOD) 
(http://biolod.org). At least once a month we update 



Will Nucleic Acids Research, 2013, Vol. 41, Web Server issue 



13 1- * mouse bioresource MEDLINE I weak % ) 



Mammalian phenotype ontology } weak 
Mouse bioresource I strong ; ) 
OMIM | weak ; ) 



weak : I 



keyword diabetes 



(3 2. 



mouse bioresource 



mouse mutant 



L mouse gene 



OMIM (sentence) 



Pathway information ! weak : ) 



Mouse protein-protein interaction weak 
Gene ontology i weak : 



Mammalian phenotype ontology I weak i] 
Mouse bioresource weak : 

Mouse gene strong F] 

MEDLINE (sentence) I weak ; | 




t keyword diabetes 



Figure 3. A partial example snapshot for 'expert mode'. The upper path (1.) shows direct search with MEDLINE, mammalian phenotype ontology, 
mouse bioresources and OMIM documents. The lower path (2.) shows an example inferential path via gene. Users can select the scoring method of 
each document from 'strong', 'weak' or 'none' in the menu. The 'strong' scoring method uses a Boolean function and the /'-value becomes 0 when 
the document has at least one keyword. The 'weak' method computes /'-value using Fisher's exact test. If a user selects 'none', the biological 
document is not used (6,7). In this mode users can confirm all PosMed search paths for biological documents. 
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Figure 4. File upload function and display of users' descriptions. Users can upload an excel file with gene IDs and descriptions by the user. PosMed 
ranks the genes listed within the files by statistical relevance between the user's keyword and each gene, and displays the ranked genes together with 
the descriptions uploaded by the user. 



Table 1. Updated biological documents for PosMed 2013 



Document set 



No. of documents 



Data contents 



Mouse bioresource 

Human gene 
Mouse gene 
Rat gene 
Arabidopsis gene 
Rice gene 
Disease 



Metabolite 

MEDLINE 

Pathway information 
Protein-protein interaction 

Gene ontology 

Human disease ontology 

Mammalian phenotype ontology 



19 280 
5115 
37 287 
85 726 
36 634 
32 041 
29 389 
20054 
2037 
12131 

49983 

9 378 134 
3809 
73 645 

12 787 
2282 
7440 



Mouse strain information registered at IMSR. 

Mouse strain information from RIKEN BioResource center. 

Gene annotation of HGNC 

Gene annotation of MGI 

Gene annotation of RGD 

Gene annotation of TAIR 

Gene annotation of RAP-DB 

Online Mendelian Inheritance in Man 

Manually collected our original data 

ICD-10, International Statistical Classification of Diseases and 

Related Health Problems 
A comprehensive species-metabolite relationship database 

(KNApSAcK) 
MEDLINE titles, abstracts and MeSH terms 
Pathway information from REACTOME 
Protein-Protein Interactions in Human and Mouse from rom 

IntAct and Arabidopsis from AtPID 
Gene ontology data 
Human disease ontology data 
Mammalian phenotype ontology data 



All data sources and links to the original DBs are described at http://omicspace.riken.jp/Data/. 



PosMed data and its search index over the 10 million 
biological documents. 

Implementation 

PosMed is implemented as a web application that users 
can access freely via their web browsers without log in. 
Although users can use a conventional web browser and a 
web browser plug-in is not needed, for Windows we rec- 
ommend Microsoft Internet Explorer 9 or later, Firefox 18 
or later and Google Chrome 24 or later. For Macintosh 
we recommend Safari 5 or later and Firefox 18 or later. 

The web server is developed in Java and contains 11 
Linux servers, including 10 distributed servers using 
GRASE engines (6) that perform direct search and infer- 
ential search in parallel, and one head server performing 
as both the Java Servlet user interface and the coordinator 
that evokes parallel search requests to the distributed 
servers and composes their results to rank the resultant 
data items. This architecture realizes scalability, so the 
search process can still be done in a few seconds even 
though our data sets are extended since our previous 
manuscript (7,9). 

Although since the first launch of the PosMed service 
we have often been requested by users to implement a 
system to support batch queries, we do not support this 
yet because of machine resource limitations (PosMed 
consumes ~1 to several seconds per query). Batch 
queries will be supported by securing additional machine 
resources in the future. 



DISCUSSION 

Since 2005, PosMed has been widely used to prioritize 
candidate genes after Quantitative Trait Locus (QTL) 
analysis in mice and successfully identify responsible 
genes (30). This time, we added a file upload function 



Nucleic Acids Research, 2013, Vol. 41, Web Server issue W113 



References 



(5) 

(12) 

(13) 

(14) 

(15) 

(16) 

(17) 

(18) 

(7) 

(19) 

(20) 

(21) 
(22) 
(23,24) 

(25) 
(26) 
(27) 



(described in Figure 4) to modify the application of 
QTL analysis to exome sequencing studies. For data 
content, we added three new databases and ontologies to 
expand PosMed inferential search to bioresources. These 
data sets allow PosMed to discover bioresources with 
phenotypes, while most other databases only support 
genetic information. We expect our work to assist active 
use of bioresources. 

Several eminent databases have released RDF files, but 
not so many scientists use Semantic Web technology 
actively. This may be partially because bioinformaticians 
like to calculate the statistical significance of associations 
of the RDF connections rather than a simple Boolean re- 
trieval of the connections. To solve this problem, we 
propose our original methodology SWAS for statistical 
searching of the biological Semantic Web data. PosMed 
executes SWAS to rank significantly enriched groups of 
biological resource data items that can be associated with 
a user-specified query through the big data of Medline, 
omics data sets, other semantic data and so on. Our 
results confirm that such enrichment analysis using our 
SWAS methodology is effective (31) and provides many 
practical usage cases of enrichment studies including bio- 
logical resource ranking problems. In the near future of 
big-data-driven science, the SWAS methodology needs to 
be added to the SPARQL end point services worldwide for 
any user to execute enrichment study over linked open 
data distributed around the world. 
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